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ABSTRACT 


EXCESS  SAMPLE  SIZE  AND  THE  ‘DELTA  WOBBLE’  IN  RANDOMIZED 
CONTROLLED  TRIALS 

Michael  A.  Fischer  and  Alvan  R.  Feinstein.  Department  of  Internal  Medicine,  Yale 
University,  School  of  Medicine,  New  Haven,  CT. 

To  determine  the  occurrence  and  consequences  of  excess  sample  sizes  in  large 
randomized  controlled  trials,  we  reviewed  158  randomized  controlled  trials,  each 
containing  more  than  100  patients,  published  in  Lancet,  Journal  of  the  American  Medical 
Association,  and  New  England  Journal  of  Medicine  during  the  three  years  1990-1992. 

Of  98  trials  with  statistically  significant  differences  between  control  and  experimental 
groups,  the  reported  P  values  were  less  than  0.001  in  27  (28%)  and  less  than  0.01  in  an 
additional  35  (36%).  Since  sample  sizes  are  usually  calculated  to  provide  P  values  of 
0  05,  the  occurrence  of  values  less  than  0  01,  and  particularly  below  0.001,  suggests  either 
that  sample  size  was  excessive  or  that  the  investigators  found  differences  much  larger  than 
5  (the  anticipated  difference  for  “clinical  importance”).  The  original  anticipations  were 
difficult  to  determine,  however,  because  sample  size  calculations  were  not  reported 
consistently:  among  the  158  trials,  the  details  were  presented  completely  in  78  (49%),  but 
wholly  omitted  in  58  (37%).  Of  54  trials  that  stated  the  value  of  5  and  that  found 
“statistical  significance,”  31  had  P  values  below  0.01,  but  only  10  of  these  trials  had 
observed  differences  that  were  at  least  25%  larger  than  5;  in  the  remaining  21,  the  small  P 
value  was  attained  only  by  excess  sample  size.  On  the  other  hand,  15  trials  found  P<0.05 
and  claimed  statistical  significance  although  the  observed  difference  was  at  least  25% 
smaller  than  8. 

The  problems  of  excessive  sample  size  (and  resources)  probably  arise  from  the 
customary  Neyman-Pearson  strategy,  which  tries  to  satisfy  two  (contradictory)  statistical 
hypotheses,  thereby  making  sample  size  much  larger  than  what  is  needed  for  a  single  null 
hypothesis.  The  excessive  sample  size  may  then  allow  “statistical  significance”  to  be 
found  and  emphasized  for  differences  much  smaller  than  what  was  originally  anticipated. 
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I.  INTRODUCTION 

“The  value  for  which  P-.05,  or  1  in  20,  ...  is  convenient  ... 
as  a  limit  in  judging  whether  a  deviation  is  to  be  considered 
significant  or  not.”  1,p  44 

Sir  Ronald  Fisher  originally  made  the  above  definition  in  1925.  Today  the 
threshold  of  P  below  0.05  remains  the  principal  criterion  for  statistical  significance,  often 
representing  the  difference  between  research  that  changes  current  practice  and  research 
that  does  not.  In  many  clinical  trials,  however,  the  authors  report  P  values  that  are  much 
smaller  than  0.05.  Very  small  P  values  can  arise  either  because  the  difference  in  the  event 
rates  reported  is  much  larger  than  originally  anticipated  or  because  the  sample  size  was 
excessively  large. 

This  research  was  aimed  at  documenting  the  frequency  with  which  randomized 
controlled  trials  in  major  medical  journals  report  very  small  P  values,  and  to  suggest 
possible  reasons  for  the  phenomenon.  The  research  data  were  obtained  by  reviewing  the 
sample  size  calculations  and  the  subsequent  results  described  in  the  published  trials. 

The  remainder  of  this  section  contains  a  review  of  previous  literature  on  sample 
size  calculation  in  randomized  controlled  trials.  Section  II  describes  the  methods  used  to 
assemble  a  group  of  randomized  controlled  trials  for  review  and  to  abstract  data  from 
those  articles.  Section  III  presents  the  main  results  of  the  study,  including  the  range  of  P 
values  in  the  reviewed  articles,  the  extent  of  reporting  for  sample  size  calculations,  and  the 
differences  between  the  reported  and  the  originally  anticipated  event  rates.  Section  IV 
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discusses  the  implications  of  the  results  in  Section  III  and  shows  some  illustrative 
examples  of  calculations  of  sample  size.  Section  V  contains  the  conclusions. 

A.  Literature  Review  of  Studies  Reporting  on  Sample  Size  Calculation 

Previous  studies  of  the  reporting  of  sample  size  calculation  in  randomized 
controlled  trials  have  repeatedly  shown  that  most  authors  do  not  report  the  details  of  their 
sample  size  calculations.  In  1978  Ambroz  et  al.  reviewed  172  randomized  controlled  trials 
and  found  that  none  of  the  publications  reported  a  sample  size  calculation  2.  A  1982 
review  of  67  randomized  controlled  trials  found  that  12%  of  the  articles  reported  the 
details  of  the  sample  size  calculation  and  3%  provided  partial  information  about  the 
sample  size  calculation  \  In  1986,  a  survey  of  the  breast  cancer  literature  by  Liberati  et  al 
revealed  that  20  (32%)  of  63  articles  had  reported  sample  size  calculations  in  the  text 4.  In 
follow-up  phone  calls,  13  more  sets  of  authors  provided  the  details  of  sample  size 
calculation  that  had  not  been  presented  in  the  text 4.  In  a  1987  review,  5  (1 1%)  of  45 
articles  reported  sample  size  calculations  5.  In  1990  AJtman  and  Dore  found  some 
improvement  in  completeness  of  reporting:  31(39%)  of  80  trials  reported  details  of  the 
sample  size  calculations  and  only  27(34%)  of  80  articles  made  no  mention  of  advance 
consideration  of  sample  size  6. 

Two  large  literature  reviews  in  1994  showed  that  reporting  of  sample  size 
calculation  methodology  had  become  more  widespread  than  in  1978  but  was  still  far  from 
universal.  In  a  review  of  the  obstetrics  and  gynecology  literature  from  1990  and  1991, 


3 


Schulz  and  co-authors  found  that  only  50  (24%)  of  206  articles  reported  sample  size 
calculations  7 .  These  authors  also  found  considerable  variation  between  journals  in  the 
extent  of  reporting  of  sample  size  calculation  7.  Moher  et  al.,  in  a  sample  of  articles  from 
1975  to  1990,  found  that  33(32%)  of  103  reported  sample  size  calculation  8.  The 
proportion  of  articles  reporting  sample  size  calculation  had  increased  over  time,  from  0% 
in  1975  to  43%  in  1990  8. 


B.  Studies  Arguing  for  Larger  Sample  Sizes 


The  attempt  to  determine  the  proper  sample  size  for  a  randomized  controlled  trial 

can  be  seen  as  both  a  practical  and  an  ethical  concern: 

A  study  with  an  overlarge  sample  may  be  deemed  unethical  through  the 
unnecessary  involvement  of  extra  subjects  and  the  correspondingly 
increased  costs  ...  On  the  other  hand,  a  study  with  a  sample  that  is  too 
small  will  be  unable  to  detect  clinically  important  effects.  Such  a  study  may 
be  scientifically  useless,  and  hence  unethical  in  its  use  of  subjects  and  other 
resources.  9,p  1336 


Over  the  last  two  decades,  the  predominant  argument  made  in  the  medical  literature  has 
been  for  larger  sample  sizes. 


The  importance  of  large  samples  to  avoid  type  II  error,  or  false  negative 
conclusions,  in  randomized  trials  was  brought  to  prominence  by  Freiman  et  al.  10  in  an 
influential  article,  nearly  20  years  ago,  that  noted  many  trials  reporting  no  difference 
between  control  and  experimental  groups  were  in  fact  not  able  to  rule  out  differences  as 
large  as  25  or  even  50  percent.  Freiman  et  al.  urged  that  much  larger  sample  sizes  would 
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be  needed  to  state  conclusively  that  there  was  no  difference  between  control  and 
experimental  groups.  In  a  recent  article  that  revisited  the  issue  originally  raised  by 
Freiman  et  al.,  Moher  et  al.  found  that  many  clinical  trials  reporting  no  difference  still  did 
not  have  samples  large  enough  to  exclude  effects  of  25  or  50  percent8. 

On  the  other  hand,  during  informal  reviews  of  the  literature,  several  readers  had 
noted  extremely  small  P  values,  suggesting  that  sample  sizes  were  substantially  larger  than 
needed  to  achieve  the  boundary  of  0.05.  The  current  study  was  evoked  by  questions 
about  the  frequency  and  sources  of  the  “too-large-sample”  phenomenon. 


D.  METHODS: 


A.  Assembling  the  Articles 

For  this  review,  I  chose  randomized  controlled  trials  published  in  Lancet,  New 
England  Journal  of  Medicine,  and  Journal  of  the  American  Medical  Association  during 
the  three  year  period  from  1990  through  1992,  inclusive.  This  choice  follows  the  method 
of  two  prior  studies.  In  one  of  the  most  widely  cited  reviews  of  statistical  methods  in  the 
medical  literature,  Freiman  et  al.  10  examined  articles  from  several  journals,  but  more  than 
one-half  of  the  articles  came  from  these  three  journals.  In  their  later  review  of  the  same 
topics,  Moher  et  al.  8  also  examined  articles  in  those  same  three  journals.  With  the 
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emphasis  on  large  clinical  trials,  I  restricted  my  search  to  articles  with  a  sample  size  of  at 
least  100  patients. 

Table  1  summarizes  the  literature  search  and  the  criteria  for  exclusion  of  articles. 
Using  the  Medline  computer  program  in  the  summer  of  1994, 1  restricted  the  search  to 
“Clinical  Trial, "  "Multicenter  Study, "  and  "Randomized  Controlled  Trial. "  The  search 
produced  1003  articles,  which  were  reviewed  to  determine  appropriateness  for  this  study. 

I  made  certain  simple  exclusions  by  inspecting  the  abstracts  cited  by  Medline,  but  other 
exclusions  required  review  of  the  text  of  the  articles.  The  search  produced  many  articles, 
including  letters  (276),  editorials  (56),  reviews  (8),  meta-analyses  (9),  and  news 
summaries  (16),  all  of  which  I  excluded.  The  238  articles  that  described  trials  containing 
fewer  than  100  patients  were  also  excluded,  as  well  as  102  articles  that  were  not 
randomized  controlled  trials,  having  been  obtained  via  the  headings  “Clinical  Trial, "  and 
"Multicenter  Study".  Additional  exclusions  were  84  trials  whose  primary  outcome 
measure  was  not  a  rate  or  proportion,  21  trials  that  were  designed  to  show  equivalence 
(rather  than  efficacy)  between  control  and  experimental  groups,  1 1  trials  that  were 
designed  to  demonstrate  vaccine  efficacy,  eight  trials  that  had  multi-stage  randomization 
schemes  or  other  complexities  that  made  them  inappropriate  for  this  analysis,  and  six  that 
were  follow-up  cohort  studies  of  patients  from  prior  randomized  controlled  trials. 
Appendix  A  lists  full  citation  information  on  the  included  articles. 


Table  1:  Details  of  literature  search 


NEJM 

Lancet 

JAMA 

Total 

Articles  Identified 

396 

490 

117 

1003 

Excluded  because: 

Letter  to  editor 

82 

190 

4 

276 

N<100 

84 

131 

23 

238 

Not  rand,  controlled  trial 

36 

50 

16 

102 

Not  measuring  event  rate 

38 

30 

26 

84 

Editorial 

47 

8 

1 

56 

Trial  for  equivalence 

14 

6 

1 

21 

“News” 

1 

2 

13 

16 

Vaccine  trials 

4 

7 

0 

11 

Meta-analysis 

2 

1 

6 

9 

Review  article 

1 

6 

1 

8 

Complex  randomization 

3 

3 

2 

8 

Follow-up  studies 

0 

2 

4 

6 

TOTAL  EXCLUDED 

312 

436 

97 

845 

TOTAL  KEPT 

84 

54 

20 

158 
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After  the  exclusions  cited  in  Table  1,  the  remaining  158  articles  were  each 
reviewed  and  suitably  excerpted  for  descriptions  of  the  sample  size  calculations.  For  trials 
that  reported  a  statistically  significant  difference  between  experimental  and  control  groups, 
the  magnitudes  of  the  main  difference  between  groups,  and  the  corresponding  P  value, 
were  recorded.  The  remainder  of  this  section  describes  the  recorded  components  of  the 
sample  size  calculation. 

B.  Introduction  to  Ney man- Pearson  Equation 

The  most  widely  accepted  method  for  calculating  sample  size  is  the  Neyman- 
Pearson  equation,  shown  below: 

(Za  +  Z/j)2  x[2x;rc  x(1-ttc)] 

n  ~ - s2 -  (  1  ) 

In  this  equation,  n  represents  the  number  of  subjects  that  will  be  required  in  each  of  two 
groups.  Za  represents  the  Z-score  that  corresponds  to  the  designated  value  of  a,  which  is 
the  risk  of  type  I  error  that  the  authors  are  willing  to  accept.  Zp  represents  the  Z-score 
that  corresponds  to  the  designated  value  of  P,  which  is  the  risk  of  a  type  II  error.  tcc 
represents  the  estimated  value  for  the  event  rate  expected  in  the  control  group,  and  the 
quantity  [2  *  nc  x  (1-tcc)]  represents  the  variance  of  that  rate.  5  represents  the  anticipated 
difference  that  the  authors  hope  to  find  between  rates  in  the  control  and  experimental 


groups. 
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The  Neyman-Pearson  sample  size  calculation  requires  two  basic  decisions:  the  first 
is  defining  the  levels  of  significance  that  will  be  used  as  cutoff  points  for  a  and  P,  and  the 
second  is  estimating  event  rates.  If  both  of  these  decisions  are  fully  described,  a  reader 
can  replicate  the  sample  size  calculation  and,  aware  of  statistical  assumptions  made  in  trial 
design,  can  understand  the  importance  of  the  subsequently  reported  P  value. 

The  level  of  significance  for  rejecting  the  null  hypothesis  is  defined  by  the 
designation  of  a,  typically  0  05  and  traditionally  two-tailed.  It  corresponds  to  a  Gaussian 
Za  value  of  1 .96.  The  a  value  of  0.05  implies  a  5%  risk  of  a  type  I  error,  in  rejecting  a 
true  null  hypothesis,  so  that  the  observed  finding  arises  from  chance  alone.  The  risk  of 
type  II  error,  in  rejecting  a  true  alternate  hypothesis  of  a  large  difference  between  groups, 
is  defined  by  P,  which  can  have  various  values,  but  is  often  designated  at  0.10.  P  can  be 
either  one-  or  two-tailed  The  assigned  value  of  P  is  often  stated  implicitly  as  the  power 
of  the  study,  which  is  calculated  as  1-P,  so  that  the  most  commonly  assigned  power  for  a 
study  is  90%. 

For  the  studies  that  provided  a  sample  size  calculation,  I  recorded  whether  the  a 
or  P  designations  were  described,  and  also  listed  as  one-  or  two-tailed.  Presentation  of  the 
power  of  a  study  was  considered  equivalent  to  presenting  the  value  of  p. 

The  other  important  decision  in  sample  size  calculation  is  a  prior  estimation  of 
event  rate  in  the  control  group(7tc)  and  the  change(5),  i.e.  delta,  that  the  authors  believe 
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would  represent  a  clinically  significant  finding  In  Equation  1  the  required  components  are 
the  variance  [27ic(  1 -7tc)]  of  the  rate  in  the  control  group  in  the  numerator  and  5  in  the 
denominator.  Many  authors  do  not  present  both  of  these  designations.  When  authors 
present  only  the  5  that  they  hoped  to  find,  it  is  helpful  for  reviewing  the  final  outcome  of 
the  trial,  but  does  not  provide  enough  information  for  the  reader  to  re-create  the  sample 
size  calculation.  Many  authors  do  not  cite  the  absolute  difference  which  they  hoped  to 
find  for  5,  but  instead  describe  0,  the  desired  proportional  (or  relative)  change,  which 
would  usually  be  calculated  as  5/7tc.  The  presentation  of  only  0  gives  some  information 
about  the  authors’  assumptions,  but  does  not  allow  re-creation  of  the  sample  size 
calculation.  The  sample  size  calculation  can  be  replicated  only  if  authors  provide  their 
prior  designation  of  7tc  together  with  any  citation  of  7te  (the  anticipated  rate  in  the 
experimental  group),  5,  or  0. 

For  the  articles  that  provided  sample  size  calculations,  I  recorded  which  of  these 


features  were  reported 
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III.  RESULTS 


A.  Range  of  P  Values 

Of  the  158  articles  in  this  sample,  104  reported  statistically  significant  differences 
between  the  control  and  experimental  groups.  Of  those  104  articles,  the  six  that  did  not 
use  P  values  in  discussing  the  results  were  not  included  in  this  section.  Table  2  shows  that 
among  the  98  articles  with  statistically  significant  outcomes,  27  (28%)  reported  P  values 
less  than  or  equal  to  0.001,  35  articles  (36%)  had  P  values  that  were  between  0  01  and 
0.001,  and  the  remaining  36  articles  reported  P  values  between  0.05  and  0.01. 

Given  that  P<0.05  is  the  commonly  accepted  threshold  for  statistical  significance, 
it  seems  surprising  that  over  25%  of  the  articles  reported  P  values  50  times  smaller  than 
the  threshold  value  (i.e.  <0.001).  Two  possible  explanations  could  account  for  these 
extremely  small  P  values:  the  observed  difference  (hereafter  referred  to  as  do)  found  to  be 
statistically  significant  might  have  been  much  larger  than  the  difference  (6)  that  the  authors 
expected  to  find;  alternatively,  the  number  of  patients  in  the  trials  might  have  been  much 
larger  than  needed  to  achieve  significance  at  the  P<0.05  level*  .  To  assess  the  frequency 
of  these  explanations,  the  original  sample  size  calculations  must  be  examined  to  determine 
the  event  rates  that  were  estimated  when  the  trial  was  designed. 


Section  III.C  will  point  out  a  third  explanation  -  altered  variance  -  for  small  P  values,  and  the  example 
in  Section  IV.B.3  will  expand  on  that  explanation 


Table  2:  Range  of  P  values  in  trials  with  statistically  significant  outcomes 


P  value 

Number  of  articles(%) 

P<0.001 

27  (28%) 

0.001<P<0.01 

35  (36%) 

0.01<P<0.05 

36  (37%) 

12 


B.  Sample  Size  Calculations 

1.  Presentation  of  Sample  Size  Calculations 

Table  3  summarizes  the  ways  in  which  authors  reported  their  sample  size 
calculations.  Of  the  158  large,  randomized  controlled  trials  under  analysis,  58  (37%)  were 
reported  with  no  description  of  how  sample  size  was  calculated  and  with  no  reference  to  a 
previously  published  calculation.  Of  100  (63%)  that  provided  at  least  some  description  of 
the  calculation,  12  required  reference  to  a  prior  publication  to  find  some  or  all  of  the  main 
components  in  the  sample  size  calculation  The  cited  100  articles  were  stratified  into  the 
several  groups  shown  in  Table  3.  The  first  column  shows  that  78  reports  provided  prior 
designations  of  event  rates.  The  remaining  22  articles  provided  less  detailed  descriptions 
that  would  limit  a  reader’s  ability  to  fully  understand  the  assumptions  that  went  into  the 
sample  size  calculation. 

The  rows  of  Table  3  show  the  extent  to  which  authors  noted  their  prior 
designations  of  a  and  p.  For  the  sake  of  simplicity  this  table  does  not  include  whether 
authors  indicated  if  their  designations  of  a  and  p  were  one-  or  two-tailed  Almost  half  of 
the  authors  who  presented  a  value  of  a  classified  it  as  one-  or  two-tailed  (42/87),  while 
very  few  authors  noted  whether  their  values  of  P  were  one-  or  two-tailed  (6/91).  The  first 
row  shows  that  83  articles  (53%  of  the  total  sample)  presented  both  a  and  P  designations. 
Only  four  authors  (3%)  presented  only  a  values  (2nd  row)  and  eight  (5%)  reported  only  P 
values  (3rd  row).  The  fourth  row  of  Table  3  shows  that  in  addition  to  the  58  articles  that 
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Table  3:  Elements  reported  for  sample  size  calculation 


Designation 

of 

7tc  and  5 
described  * 

Designation 

of 

5  described, 
but  7tc  not 
described 

Designation 

of 

9  described, 
but  neither 

7tc  nor  5 
described 

No  event 

rate 

designations 

described 

TOTAL 

Designations  of 
a  and  (3 
described 

67 

5 

9 

2 

83 

Designation  of 
a  alone 
described 

2 

1 

0 

1 

4 

Designation  of 
(3  alone 
described 

4 

1 

3 

0 

8 

Neither  a  nor 
(3  designation  ; 
described 

5 

0 

0 

58 

63 

TOTAL 

78 

7 

12 

61 

158 

7ie  or  9  were  acceptable  substitutes  for  5 
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did  not  describe  any  sample  size  calculation,  five  others  (with  some  description)  did  not 
include  their  designations  of  either  a  or  (3. 

The  columns  of  Table  3  show  the  frequency  with  which  authors  reported  their 
prior  estimations  of  outcome  rates  in  the  control  and  experimental  groups.  As  noted 
earlier,  proper  interpretation  of  results  requires  knowledge  of  nc  and  of  5,  although  the 
value  of  6  could  be  calculated  by  subtraction  if  nc  and  ne  are  described,  or  by  appropriate 
multiplication  if  7tc  (or  7te)  and  9  are  described.  The  first  column  shows  that  78  articles 
(49%  of  the  total  sample)  either  provided  both  7ic  and  5,  or  provided  kc  and  enough 
information  for  5  to  be  easily  calculated.  Smaller  numbers  of  articles  presented  either  5 
alone  (2nd  column,  7/158,  4%)  or  9  alone  (3rd  column,  12/158,  8%).  No  articles  in  the 
sample  reported  nc  alone  without  some  indication  of  the  desired  change,  but  the  fourth 

column  shows  that,  beyond  the  58  articles  with  no  information  on  the  sample  size 
calculation,  only  three  articles  gave  no  indication  of  the  estimated  event  rates  or  desired 
differences. 

2.  Completeness  of  Reporting  for  Sample  Size  Calculation 

As  noted  earlier,  a  complete  description  of  the  sample  size  calculation  would  allow 
understanding  of  the  methods  used  by  the  investigators.  The  67  trials  in  the  upper  left- 
hand  comer  of  Table  3  provided  complete  or  near-complete  descriptions  of  sample  size 
calculations,  limited  only  by  inconsistent  reporting  of  whether  a  and  (3  were  one-  or  two- 


15 


tailed  (four  articles  reported  this  information  for  both  a  and  P).  Similarly,  for  the  1 1 
articles  in  the  rest  of  the  first  column  of  Table  3,  values  of  nc  and  5  are  described 

(although  a  and  P  are  not  both  reported)  so  that  a  reader  could  appreciate  the  connotation 
of  the  P  values  reported  by  the  authors. 

For  the  19  articles  in  the  middle  two  columns  of  Table  3,  the  reader  would  be 
hard-pressed  to  understand  the  sample  size  calculation.  A  determined  reader  could  make 
multiple  guesses  at  the  designation  of  7tc,  perhaps  getting  a  sense  of  the  prior  estimates  but 
the  process  would  be  quite  laborious.  For  the  61  articles  in  the  right-hand  column  of 
Table  3,  there  is  no  way  for  a  reader  to  understand  the  sample  size  calculation,  especially 
for  the  58  articles  that  provided  no  details  at  all. 

In  summary,  only  4  of  the  158  articles  (3%)  reported  all  of  the  information  needed 
to  understand  the  sample  size  calculation,  but  an  additional  63  articles  (40%)  offered 
almost  complete  information.  Eleven  articles  (7%)  provided  incomplete  information  but 
described  the  critical  rates  that  would  allow  the  reader  to  evaluate  the  outcome  of  the  trial. 
Nineteen  additional  articles  (12%)  offered  information  that  might  allow  for  a  general  sense 
of  the  sample  size  calculation,  but  was  too  limited  for  full  evaluation  of  the  results.  Sixty- 
one  articles  (39%)  did  not  provide  information  that  would  allow  any  realistic  attempt  at 
understanding  the  sample  size  calculation.  There  was  no  correlation  between  the  actual 
sample  sizes  in  the  trials  and  the  extent  of  reporting  of  the  sample  size  calculations. 
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Overall,  78  of  the  articles  (49%)  provided  enough  information  for  readers  to  understand 
the  important  components  of  the  sample  size  calculation. 

For  the  78  articles  which  presented  their  prior  designations  of  event  rates,  if 
significant  differences  were  found,  readers  could  evaluate  the  magnitude  of  the  P  values 
reported  Table  4  shows  that  of  the  158  total  articles  reviewed,  54  both  described  7ic  and 
5  and  reported  statistically  significant  outcomes.  The  following  sections  will  compare 
prior  estimates  and  reported  results  for  these  54  articles. 

C  Example  of  P  Value  Calculation 

The  P  value  used  to  determine  statistical  significance  is  based  on  a  Z-score  derived 
from  the  following  equation: 


Z  = 


(2) 


In  this  equation,  the  numerator,  do,  is  equal  to  pc  (the  outcome  rate  in  the  control  group) 
minus  pe  (the  outcome  rate  in  the  experimental  group).  The  denominator  for  this 
calculation  is  the  standard  error  of  the  difference  between  groups  (SED).  For  its 
calculation,  p  is  the  average  outcome  rate  (i.e.  the  average  of  pc  and  pe),  ric  is  the  number 
of  patients  in  the  control  group,  and  n^  is  the  number  of  patients  in  the  experimental 


group. 
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Table  4:  Criteria  for  inclusion  of  articles  in  analysis  of  8  versus  d0 


Articles  reporting  a  statistically  significant 
d0  with  a  P  value 

Yes 

No 

TOTAL 

Articles  that 
reported  both 

7TC  and  6 

54 

24 

78 

Articles  that  did  not 
report  both 

7tc  and  5 

44 

36 

80 

TOTAL  | 

98 

60 

158 
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Higher  Z-scores  correspond  to  lower  P  values,  for  example  a  Z-score  of  1.645 
corresponds  to  a  two-tailed  P  value  of  0. 10  while  a  Z-score  of  1 .96  yields  the  familiar 
two-tailed  P  value  of  0.05.  Equation  2  shows  that  a  Z-score  could  increase  in  three  ways. 
An  increase  in  d0  would  enlarge  the  numerator;  alternatively,  either  an  increase  in  n<;  or  rie 
or  a  decrease  in  the  variance  would  reduce  the  denominator.  The  next  section  will 
compare  the  range  of  observed  do  values  and  the  anticipated  8  values  in  this  group  of 
articles. 

D.  Differences  Between  d0  and  8 

Table  5  shows  that,  of  the  54  articles  which  reported  both  prior  designations  of  event  rates 
and  statistically  significant  outcomes,  the  observed  value  of  d0  was  greater  than  or  equal  to 
the  prior  assignment  of  5  in  29  cases,  but  25  articles  reported  statistical  significance  for  a 
d0  that  was  smaller  than  the  prior  designation  of  8.  We  have  named  this  latter 
phenomenon  “8  (delta)  wobble”  and  will  explain  the  term  more  completely  later  in  the 
Discussion.  In  four  extreme  instances,  the  observed  do  was  at  least  50%  smaller  than  the 
anticipated  8  As  an  example  of  “8  wobble,”  in  one  of  the  articles"  included  in  the 
second-to-last  row  of  Table  5,  a  8  value  of  0.30  was  designated  for  the  purposes  of 
sample  size  calculation,  but  a  statistically  significant  d0  of  0.152  was  presented  in  the 
results  section.  Thus  for  this  article  do  was  smaller  than  8  by  49%  [i.e.  (d0  -  8)/8  =  (0. 1 52- 
0.30)/0.30  =  -  0.49],  At  the  other  extreme,  in  one  of  the  articles12  included  in  the  first  row 


of  Table  5,  a  8  of  0.20  was 


Table  5:  Frequency  of  values  for  the  proportionate  difference  (d0-5)/5 
(Negative  percentage  for  do<5,  positive  percentage  for  do>5) 


Percent  difference  between 
do  and  5 

Number  of  articles 

>50% 

10 

25%  to  50% 

6 

0%  to  25% 

12 

0% 

1 

-25%  to  0% 

10 

-50%  to  -25% 

11 

<-50% 

4 
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designated  for  sample  size  calculation,  but  a  do  of  0.37  was  presented  in  the  results 
section.  For  this  article  d0  was  greater  than  5  by  85%  [(0.37-0.20)/0.20  =  0.85], 

The  final  row  of  Table  5  show  that  of  the  25  articles  with  “5  wobble,”  i  .e.  a  do 
value  less  than  the  prior  designation  of  5,  four  presented  a  statistically  significant  do  that 
was  less  than  half  as  large  as  5  (first  row).  An  additional  1 1  articles  presented  d0  values 
that  were  smaller  than  5  by  a  proportionate  increment  between  one-quarter  and  one-half 
(second-to-last  row).  Seven  of  the  25  articles  with  “5  wobble”  presented  a  statistically 
significant  do  value  that  was  less  that  5  by  an  absolute  increment  of  more  than  0. 10. 

Table  6  shows  the  relationship  of  discrepancies  between  do  and  5  to  the  magnitude 
of  P  values  reported.  As  noted  previously,  increased  do  values  could  cause  very  small  P 
values.  The  first  two  rows  of  Table  6  show  that  of  the  3 1  trials  which  reported  5  and  had 
P  values  less  than  or  equal  to  0.01,  10  had  do  values  more  than  25%  larger  than  the  prior 
designation  of  5,  but  12  of  the  31  trials  with  P<0.01  had  d0  values  that  were  smaller  than 
the  prior  designation  of  5.  Table  6  demonstrates  that  the  very  small  P  values  are  not 
restricted  to  trials  reporting  do  much  larger  than  8.  Indeed,  the  final  three  rows  show  that 
even  some  of  the  trials  that  commit  “8  wobble”  report  very  small  P  values. 
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Table  6:  Frequency  of  values  for  the  proportionate  difference  (do-5)/5,  categorized 
by  magnitude  of  reported  P-value  in  statistically  significant  trials.  Last  row 
contains  those  articles  which  did  not  report  original  designation  of  5. 


Percent  difference 

P<0.001 

0.001<P<0.01 

0.01<P<0.05 

Total 

between  d0  and  5 

>50% 

4 

2 

4 

10 

25%  to  50% 

3 

1 

2 

6 

0%  to  25% 

2 

6 

4 

12 

0% 

1 

0 

0 

1 

-25%  to  0% 

0 

6 

4 

10 

-50%  to  -25% 

1 

4 

6 

11 

<-50% 

0 

1 

3 

4 

Total  reporting  5 

11 

20 

23 

54 

No  6  described 

16 

15 

13 

44 

Grand  Total 

27 

35 

36 

98 

22 


IV.  DISCUSSION 

A.  Reporting  of  Sample  Size  Calculations 

The  current  finding,  that  43%  (67/158)  of  a  group  of  published  randomized 
controlled  trials  presented  full  details  of  sample  size  calculations,  was  only  slightly  higher 
than  the  39%  found  in  a  1990  review  6  and  was  identical  to  the  rate  noted  in  an  analogous 
review  published  in  1994.  8  The  finding  that  39%  (58/158)  of  articles  reported  no  details 
of  sample  size  calculation  was  also  similar  to  the  rate  noted  in  a  1990  review  6. 


Although  the  details  of  sample  size  calculation  are  now  reported  more  often  than 

noted  in  the  first  such  reviews  almost  20  years  ago  2,  reporting  is  still  far  from  complete 

In  a  1994  review,  Moher  et  al.  suggested: 

that  authors  should  report  sample  size  calculations  and  that 
the  following  information  should  be  contained  in  all 
published  reports  of  RCTs:  (1)  The  primary  dependent 
measure(s)  should  be  clearly  identified.  (2)  A  clinically 
important  treatment  effect  should  be  specified.  (3)  The 
treatment  effect  should  be  clearly  indicated  as  being  an 
absolute  or  a  relative  difference  (4)  The  statistical  test, 
directionality,  a  level,  and  statistical  power  used  to  estimate 
sample  size  should  be  reported.  8  p  124 

This  suggestion  was  re-iterated  in  an  article  later  in  1994  by  The  Standards  of 
Reporting  Trials  Group,  an  international  committee  established  to  address  reporting  of 
randomized  controlled  trials  n.  In  light  of  the  consequences  that  are  about  to  be 
discussed,  editors  might  follow  the  recommendations  of  Moher  et  al  and  become  more 
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demanding  in  asking  authors  to  report  their  pre-trial  assumptions  when  sample  size  was 
calculated. 

B.  Very  Small  P  Values 

As  noted  in  Table  2,  more  than  25%  of  the  trials  with  statistically  significant  results 
reported  P  values  that  were  <0.001.  As  shown  in  Equation  2,  these  excessively  small  P 
values  could  have  come  from  unexpectedly  large  values  of  do,  but  Table  6  shows  that  only 
10  (32%)  of  the  3 1  articles  with  P<0.01  found  d0  to  be  substantially  larger  than  5.  The 
remainder  of  this  section  will  show,  however,  that  even  for  those  10  articles,  the  large  d0 
values  were  not  likely  to  have  caused  the  excessively  small  P  values. 

1.  An  Example  of  an  Article  with  d0  Much  Larger  Than  8 

In  one  article  where  6  was  designated  as  0.20  but  do  was  found  to  be  0.37,  this 
distinction  was  reported  as  having  a  P<0.001  12.  The  Z-score  for  this  result  can  be 
calculated  using  Equation  2.  In  the  article  cited,  pc  was  0.77,  pe  was  0.40,  rtc  was  104  and 
ne  was  95  n.  p  can  be  approximated  as  the  weighted  average  of  0.77  and  0.40,  which 
yields  0.593*.  The  Z-score  is  thus: 


*  This  value  is  calculated  as  an  average  weighted  by  the  number  of  subjects  in  each  group: 


_  ”cPc+nePe 
P  = - 


(104x0.77) +(95x0.40) 


-  0.593 


+  n. 


104  +  95 
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Z  = 


0.77-0.40 


1 

r  i  1 1 

10.593(1-0.593) 

i 

i  m 
|  On 

+ 

1 

° 

_ i 

=  =  5.31 


(3) 


Since  a  Z-score  of  3.80  corresponds  to  a  P  value  of  1x1  O'4,  this  finding  would  not  only 
result  in  P<0.001,  but  would  yield  an  infinitesimally  small  P  value  in  the  range  of 
1.0  xlO'7  Rp31. 


2.  Effect  of  Discrepancy  Between  d0  and  8  on  P  Values 


If  the  do  in  the  above  example  had  been  the  expected  0.20  instead  of  0.37,  i.e.,  if  pc 
was  0.60,  the  Z-score  would  be  calculated  as  follows: 


0.60-0.40 


i  r 

O 

l/N 

o 

'-/I 

1 

o 

L/i 

o 

L/1 

104  +  95_ 

2.82 


(4) 


The  corresponding  P  value  would  be  less  than  0.00515,p  281.  Although  much  larger  than 
the  previous  result  for  P,  this  value  is  still  10  times  smaller  than  the  boundary  of  0  05  for 
achieving  statistical  significance. 


3.  Effect  of  Discrepancy  Between  pc  and  nc  on  P  Values 


In  the  cited  article,  however,  not  only  was  d0  much  larger  than  5,  but  the  observed 
event  rates  themselves  were  considerably  larger  than  the  values  assigned  prior  to  the 
study.  The  authors  had  previously  estimated  rates  of  0.25  for  the  control  group  and  0.05 
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for  the  experimental  group  12.  If  these  outcome  rates  had  actually  been  found,  the  Z-score 
calculation  would  have  shown: 


0.25-0.05 


r 

r  i  1 1 

1 0.155(1-0.155) 

104  +  95_ 

3.89, 


(5) 


which  corresponds  to  a  P  value  of  lxl O'4  14,p  31 


In  this  example  the  discrepancy  between  the  observed  and  estimated  event  rates 
affected  the  variance  term  in  the  denominator.  The  closer  p  is  to  0.50,  the  larger  the 
variance  will  be  and  the  smaller  the  Z-score  will  be.  The  discrepancy  between  predicted 
and  reported  rates  in  this  example  moved  p  very  close  to  0.50,  but  the  P  value  was  still 
reported  as  statistically  significant  by  a  wide  margin.  The  more  striking  point,  however,  is 
that  the  Z-score  calculated  in  Equation  5  represents  the  clinical  outcome  expected  by  the 
authors,  but  the  corresponding  P  value  is  extremely  small.  If  the  very  small  P  value  is  due 
neither  to  large  do  values  nor  to  discrepancies  between  pc  and  7tc,  the  only  remaining  cause 
for  the  small  P  values  is  excessively  large  sample  sizes. 


C  The  Phenomenon  of  “S  Wobble  ” 


In  addition  to  showing  that  increased  d0  values  seldom  cause  very  small  P  values, 
Table  5  and  Table  6  reveal  an  additional  phenomenon.  As  noted  in  Section  III.D,  25  of 
the  54  articles  in  Table  5  reported  d0  values  smaller  than  5;  indeed  15  of  the  “statistically 
significant”  d0  values  were  at  least  25%  smaller  than  6.  If  the  5  value  entered  into  the 
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Neyman-Pearson  equation  represents  the  minimum  boundary  to  be  regarded  as  clinically 
significant,  the  frequent  citation  of  “significance”  for  d0  values  much  smaller  than  the  pre¬ 
assigned  5  calls  into  question  the  initial  design  of  the  trial.  The  remainder  of  this  section 
will  show  a  hypothetical  sample  size  calculation  to  demonstrate  that  the  Neyman-Pearson 
equation  converts  5  into  a  “wobbly”  parameter.  The  calculated  sample  sizes  will  allow 
values  of  d0  much  smaller  than  5  to  be  declared  statistically  significant. 

The  values  in  the  example  below  are  chosen  arbitrarily,  but  the  results  will  hold  for 
any  set  of  values  if  readers  want  to  replicate  the  exercise.  The  Neyman-Pearson 
calculation  shown  in  Equation  1  in  Section  II  B  is  the  standard  method  used  for 
calculating  sample  sizes.  The  elements  of  the  calculation  were  described  in  Section  II  B 
and  will  not  be  repeated  here. 

1.  Calculation  of  the  Sample  Size 

For  this  example,  I  will  assume  that  the  mortality  rate  of  0.20  with  current  therapy 
for  disease  X  is  to  be  tested  against  a  new  treatment  that  is  expected  to  reduce  the 
mortality  rate  to  0. 10.  Following  convention,  the  researchers  designate  an  a  value  of  0.05 
(two-tailed)  and  a  (3  value  of  0.10  (one-tailed)  for  the  purposes  of  sample  size  calculation. 
The  sample  size  calculated  using  the  Neyman-Pearson  equation  will  be  as  follows: 


(1.96  + 1.282)2  x  [2  x  0.2  x  (1  -  0.2)] 

n> - ; - =  336 

0 102 


(6) 
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The  researchers  therefore  recruit  a  total  of  672  patients  to  their  study,  336  to  receive 
current  therapy  and  336  to  receive  experimental  treatment.  With  this  backdrop,  I  will  now 
consider  possible  outcomes. 


2.  Scenario  1:  d0  Equal  to  8 


In  the  first  scenario,  the  researchers  find  exactly  what  they  had  expected. 

Mortality  in  the  control  group  is  67/336,  or  0.199,  while  mortality  in  the  experimental 
group  is  34/336,  or  0.101.  To  test  the  statistical  significance  of  these  findings,  the  Z  score 
is  calculated  with  Equation  2  to  show: 


0.199- 

0.101 

I 

1  1  1 

0.15(1-0.15) 

+ 

L336  336 J 

3.56 


(7) 


The  corresponding  two-tailed  P  value  for  this  result  is  0.0002I5,p  28°,  which  is  much 
smaller  than  the  anticipated  P=0.05,  although  the  d0  found  by  the  researchers  almost 
exactly  equals  the  prior  designation  of  5. 


3.  Scenario  2:  d0  smaller  than  8,  but  is  Statistically  Significant 

In  the  second  scenario,  mortality  in  the  control  group  is  again  67/336,  or  0. 199, 
but  mortality  in  the  experimental  group  is  47/336,  or  0. 140.  This  d0  of  0.059  is  much 
smaller  than  what  was  hoped,  but  calculation  of  a  Z  score  reveals: 
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Z  = 


0.199-0.140 


1 

r  1  i  ] 

0.170(1-0.170) 

.336  +  336. 

=  2.04 


(8) 


This  corresponds  to  a  P  value  of  0.041415,p  28°.  Armed  with  this  result,  the  investigators 
can  now  present  this  d0  of  0.059  as  statistically  significant,  although  it  is  almost  half  of  the 
5  value  of  0. 10  which  they  designated  as  a  difference  worth  finding  before  the  trial  began. 


4.  Scenario  3:  d0  Smaller  than  8,  but  not  Statistically  Significant 


In  a  third  scenario,  control  group  mortality  remains  at  67/336  (0. 199),  but  one 
additional  experimental  group  patient  dies,  so  that  experimental  group  mortality  is  48/336, 
or  0. 1 43 .  The  do  of  0.056  is  again  smaller  than  the  prior  designation  of  8.  Calculation  of 
the  Z-score  shows: 


0.199-0.143 


1 

1  1  1 

0.171(1-0.171) 

.336  +  336. 

1.93 


(9) 


This  corresponds  to  a  P  value  that  is  slightly  greater  than  0.0515,p  280,  so  that  the 
investigators  cannot  claim  the  result  is  significantly  different  from  zero.  Persevering,  the 
researchers  recall  that  the  Zp  term  in  the  sample  size  calculation  was  1.282,  corresponding 
to  a  one-tailed  (3  of  0. 10.  A  Z-score  for  the  alternate  hypothesis  can  be  calculated  using 
the  following  formula: 


.  _S-d0 
A  SED 


(10) 
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This  is  almost  the  same  Z-score  formula  shown  in  Equation  2,  but  differing  in  the  use  of 
the  quantity  5-  d0  in  the  numerator*,  and  it  results  in  the  following  calculation: 


Z 


A 


0.10-0.056 


1 

1  1 

Jo,171(l  -  0.171) 

336  +  336_ 

(11) 


Since  this  value  is  greater  than  the  threshold  value  of  1 .282  used  in  the  sample  size 
calculation,  the  investigators  can  reject  the  alternate  hypothesis  of  a  large  difference 
between  groups.  The  investigators  can  now  claim  they  have  proven  that  there  is  no 
important  difference  between  the  two  treatments,  as  their  results  excluded  a  difference  of 
0.10. 


5.  A  Zone  of  Double  Significance 


A  particularly  interesting  result  arises,  however,  if  we  return  to  the  scenario  in 
Section  IV. C. 3,  and  to  the  Z-score  calculated  in  Equation  8.  If,  for  that  same  result,  the 
researchers  had  calculated  a  ZA  for  the  alternative  hypothesis: 


Z 


A 


0.10-0.059 


1 

1  1  1 

J 0.170(1-0.170) 

336  +  336_ 

(12) 


The  SED  for  the  alternate  hypothesis  is  properly  calculated  as  follows: 


seda  = 


11 


In  this  example,  both  the  above  calculation  and  the  calculation  shown  in  the  denominator  of  Equation  1 1 
produce  a  value  of  0  029.  The  standard  calculation  of  SED  will  be  used  for  the  remaining  examples. 
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This  result  is  greater  than  the  prior  Zp  value  of  1 .282,  and  the  investigators  can  declare 
that  they  have  proven  that  there  is  no  important  difference  between  groups,  as  this  result 
excludes  a  difference  of  0. 10.  For  this  result,  therefore,  the  investigators  have  achieved 
double  statistical  significance.  The  difference  between  groups  is  both  statistically 
significantly  greater  than  zero  (Equation  8)  and  is  also  statistically  significantly  smaller 
than  the  difference  initially  defined  as  clinically  significant  (Equation  12). 

6.  Neyman-Pearson  Equation  Shifts  Threshold  for  Statistical  Significance 

Although  the  investigators  stated  initially  that  a  difference  of  0. 10  between 
treatments  represented  the  boundary  for  clinical  importance,  the  sample  size  calculated 
with  the  Neyman-Pearson  equation  would  in  fact  allow  them  to  declare  statistical 
significance  for  a  d0  as  small  as  0.059,  or  any  larger  value  At  that  same  level  of  do,  and  at 
any  smaller  value,  the  investigators  could  declare  that  their  result  was  statistically 
significantly  smaller  than  0  10.  The  crucial  observation  here  is  that  when  sample  size  is 
calculated  with  the  Neyman-Pearson  equation  for  this  example,  the  investigators  will  find 
a  statistically  significant  result  no  matter  what  value  emerges  for  do.  The  particular  values 
shown  in  Section  IV. C. 5  will  even  achieve  double  significance;  thus,  the  threshold  for 
significance  is  no  longer  the  clinical  threshold  that  was  used  in  trial  design.  In  addition, 
when  the  final  result  does  equal  the  threshold  defined  as  clinically  significant  for  trial 
design,  the  P  value  is  orders  of  magnitude  smaller  than  the  0.05  value  that  conventionally 
determines  statistical  significance. 
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D.  Calculation  of  Implicit  Thresholds 

The  example  in  the  preceding  section  demonstrated  some  of  the  consequences  of 
using  a  sample  size  generated  by  the  Neyman-Pearson  calculation.  In  this  section  I  will 
show  how  those  consequences  arise.  The  crux  of  this  argument  is  that  the  Neyman- 
Pearson  calculation  uses  both  Za  and  Zp,  thereby  incorporating  both  null  and  alternate 
hypotheses  in  one  formula.  In  the  analysis  of  results,  however,  these  hypotheses  are 
evaluated  separately,  each  with  an  individual  calculation. 


If  one  calculated  a  sample  size  considering  only  the  possibility  of  type  I  error,  the 
sample  size  calculation  would  be  as  follows: 


n  > 


(Za)2  x[2x^x(l  -  n)] 

S2 


(13) 


This  differs  from  the  Neyman-Pearson  calculation  in  that  Zp  is  not  included  Conversely,  if 
one  calculated  a  sample  size  with  concern  for  type  II  error  only,  the  following  equation 
would  be  used: 


(Zp)2  x[2x^x(l  - n )] 


(14) 


In  this  case,  Za  has  been  excluded  from  the  numerator.  Additionally,  the  denominator 
term  (5  -  d0)  reflects  the  increment  at  which  a  difference  of  5  will  be  ruled  out  at  the  Zp 


level  of  significance. 
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Combining  the  previous  two  equations  with  the  results  of  the  example  from  the 
previous  section,  I  will  now  demonstrate  that  the  thresholds  for  statistical  significance  are 
not  those  demarcated  in  the  Neyman-Pearson  equation.  In  Equation  6,  the  calculated 
sample  size  was  336  patients  per  group.  With  this  sample  size,  consider  Equation  13. 
Inserting  336  for  N,  1.96  for  Za,  and  keeping  the  same  term  in  the  denominator,  the  result 
is  as  follows: 


336  = 


(1.96)2  x  [2x0.2  x  (1-0.2)] 

I2 


(15) 


Rearranging  equation  1 5  to  solve  for  5  produces: 


5  = 


|(1.96)2  x  [2x0.2  x  (1-0.2)] 


336 


=  0.060 


(16) 


The  implication  of  the  above  calculation  is  that  although  a  5  value  of  0  . 10  was  designated 
when  the  Neyman-Pearson  equation  was  used  to  calculate  sample  size,  the  result  in  fact 
represents  an  implicit  5  designation  of  0.060.  In  other  words,  although  the  study  design 
designates  a  difference  of  0.10  as  clinically  important,  differences  as  small  as  0.060  will  be 
found  to  be  statistically  significant.  This  result  explains  both  the  findings  summarized  in 
section  III.D  and  the  outcome  of  the  example  trial  presented  in  section  IV. C. 


The  same  process  can  be  utilized  for  the  calculation  shown  in  Equation  14.  Again 
combining  that  equation  with  the  result  of  Equation  6,  but  omitting  the  intermediate  steps. 


the  final  result  will  be  as  follows: 
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/ (1.282)2  x  2  x  0.2  x(l-0.2)l 

S-dQ  =  J- - - - - - - - -  =  0.040  (  17 

0  v  336  v 

Since  the  prior  designation  of  5  in  this  example  was  0. 10,  this  result  shows  that  for  any  d0 
value  of  0.060  or  smaller,  the  ZA  value  for  the  alternate  hypothesis  will  be  greater  than 
1.282  and  the  alternate  hypothesis  of  a  large  difference  between  control  and  experimental 
groups  will  be  rejected. 

The  results  of  Equations  16  and  17  converge  around  a  do  value  of  0.060.  The 
preceding  paragraphs  show  that  for  the  sample  size  determined  with  the  Neyman-Pearson 
calculation  in  Equation  6,  any  value  larger  than  0.060  will  be  statistically  significantly 
greater  than  0,  while  any  value  smaller  than  0.060  will  be  statistically  significantly  smaller 
than  0. 10.  In  fact,  since  the  SED  is  used  for  Z-score  calculation  (see  Equation  2)  and  the 
variance  of  the  control  group  rate  is  used  in  the  standard  form  of  the  Neyman-Pearson 
formula  (see  Equation  1),  the  actual  numbers  in  Section  IV. C  stretch  further,  so  that 
values  somewhat  smaller  than  0.060  will  be  statistically  significantly  greater  than  0,  and 
some  of  these  values  will  also  be  statistically  significantly  smaller  than  0. 10.  In  practical 
terms,  then,  the  Neyman-Pearson  sample  size  calculation  has  guaranteed  a  statistically 
significant  result  in  one  direction  or  the  other,  and  has  even  created  a  zone  of  double 
significance.  The  value  0.060  has  now  become  the  threshold  at  which  the  therapy  being 
studied  will  be  declared  effective  or  ineffective,  although  this  value  is  barely  half  of  the 
value  originally  designated  by  the  investigators.  This  result  is  not  particular  to  the 
numbers  chosen  for  this  example,  and  will  be  found  with  any  other  set  of  values  that  are 


chosen  for  the  illustration. 
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V.  CONCLUSION 


This  paper  has  attempted  to  show  consequences  of  excess  sample  size  in  large 
randomized  controlled  trials.  Since  a  landmark  paper  by  Freiman  et  al.  almost  20  years 
ago,  most  discussion  of  sample  sizes  has  focused  on  the  need  for  larger  samples  The 
results  of  the  present  study,  however,  suggest  that  enlargement  of  sample  sizes,  and  in 
particular  the  use  of  the  Neyman-Pearson  equation  to  calculate  these  large  samples,  may 
have  two  important  unintended  consequences. 

The  first  consequence  is  that  with  the  large  sample  sizes  generated  by  the  Neyman- 
Pearson  equation,  many  results  will  produce  extremely  small  P  values,  despite  the  general 
acceptance  of  0.05  as  the  threshold  for  a  statistically  significant  finding.  As  noted  in  the 
introduction,  an  overly  large  sample  generates  excess  cost  and  requires  an  excessive 
number  of  patients.  As  research  handing  dwindles,  the  excessive  costs  of  oversized  trials 
represent  a  substantial  overuse  of  resources. 

The  second  consequence  of  sample  sizes  calculated  with  the  Neyman-Pearson 
equation  is  that  the  quantitative  threshold  for  an  impressive  difference  is  reduced  to  almost 
half  of  the  initial  level.  Thus,  whatever  5  investigators  designate  at  the  outset  of  the  trial, 
the  d0  that  can  be  declared  statistically  significant  will  be  considerably  smaller  than  the 
original  value  of  5,  we  have  named  this  phenomenon  “8  wobble.”  The  “wobbliness”  of  the 
boundary  for  clinical  significance  defined  for  sample  size  calculation  undermines  the 
clinical  judgment  used  in  originally  defining  5. 
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The  problems  of  excessively  small  P  values  and  “5  wobble”  suggest  that  the 
Neyman-Pearson  strategy  of  sample  size  calculation  requires  serious  reevaluation  It  is 
beyond  the  scope  of  this  paper  to  suggest  alternative  methods  for  sample  size  calculation. 
Until  such  methods  are  developed,  however,  editors  can  address  the  problem  of  “6 
wobble”  by  requiring  investigators  to  state  both  5  and  d0,  and  to  justify  reporting  of 
statistically  significant  d0  values  which  are  smaller  than  the  5  initially  designated  as 
clinically  significant. 
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